Introduction

The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.

There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.

Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.

Sub-problem 1: load and summarize the data (20 points)

Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.

#Read in the datasets
wine.white <- read.csv("winequality-white.csv", header = TRUE, sep=";")
wine.red <- read.csv("winequality-red.csv", header = TRUE, sep=";")


#Some exploratory analysis to get a feel for the data
head(wine.white)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
dim(wine.white)
## [1] 4898   12
summary(wine.white)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
#It also appears that there are some outliers that we may want to plot (especially for fixed acidity). 


#We are also using the numeric summary to confirm that numerics have been read in properly and not as factors or strings
plotType <- function(domain=wine.white,FUN=hist){
  FUN = match.fun(FUN)
  par(mfrow=c(3,4))
  for (i in 1:dim(domain)[2]){
    FUN(domain[,i],xlab = names(domain)[i],main = "")
  }
}
plotType(domain = wine.white, FUN=plot)

plotType(domain = wine.white, FUN=boxplot)

#From this initial plot, we can see that the following variables have outliers that should be removed: fixed.acidity, citric acid, residual sugar, free sulfur dioxide, total sulfur dioxide and density


par(mfrow=c(1,1))
attach(wine.white)
plot(citric.acid)

plot(citric.acid[1400:1700])

# There is also something strange going on for citric acid between index 1400 -1700. It looks like this may be a data entry error as the values only take on 3 discrete values. lets remove these first

#Removing records with the strange citric acid values.
wwc <- wine.white[c(1:1400,1700:nrow(wine.white)),]

#Next,we will remove the outliers identified
wwc <- subset(wwc, 
              fixed.acidity < 12 &
              citric.acid < 1.5 &
              residual.sugar < 30 & 
              free.sulfur.dioxide < 200 &
              total.sulfur.dioxide <400 & 
              density < 1)


plotType(domain = wwc, FUN=hist)

#From the histogram, we can see that many distributions are right skewed.


pairs(wwc)

#We can also see from the correlation plot, that a few variables are showing signs of heteroskedacity so this will need to be fixed.

#Looking over the previous evidence, and that for the red wine, the 6 variables above were log-transformed. This was kept consistent between both red and white wine as we are combining them later.
wwm.all <- mutate(wwc, fixed.acidity=log(fixed.acidity),
                   volatile.acidity=log(volatile.acidity),
                   chlorides=log(chlorides),
                   free.sulfur.dioxide=log(free.sulfur.dioxide),
                   total.sulfur.dioxide=log(total.sulfur.dioxide),
                   sulphates=log(sulphates)
                   )


# The strength of the correlation is not obvious from the pairs plot, and the standard output from cor() is not so easy to interpret, so using the psych library to display it all together

psych::pairs.panels(wwm.all)

cor.wm <- as.data.frame(as.table(cor(wwm.all)))
subset(cor.wm, abs(Freq) > 0.5 & Var1 != Var2) %>% arrange(-abs(Freq))
##                    Var1                 Var2       Freq
## 1               density       residual.sugar  0.8185135
## 2        residual.sugar              density  0.8185135
## 3               alcohol              density -0.8096686
## 4               density              alcohol -0.8096686
## 5  total.sulfur.dioxide  free.sulfur.dioxide  0.6243473
## 6   free.sulfur.dioxide total.sulfur.dioxide  0.6243473
## 7               density total.sulfur.dioxide  0.5095204
## 8  total.sulfur.dioxide              density  0.5095204
## 9               alcohol            chlorides -0.5054419
## 10            chlorides              alcohol -0.5054419
#Using a cutoff of .5 to show highly correlated variables.
#We can see that the highest correlation is between density and residual sugar, then alcohol and density. Then alcholol with chlorides

# Finally, going to create a test and train set
#smp.size <- floor(0.8 * nrow(wwm.all))

## set the seed to make your partition reproductible
#set.seed(123)
#train.ind <- sample(seq_len(nrow(wwm.all)), size = smp.size)

#wwm <- wwm.all[train.ind, ]
#wwm.test <- wwm.all[-train.ind, ]

wwm <- wwm.all
###############################################
# Repeating above analysis for red wine

head(wine.red)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
dim(wine.red)
## [1] 1599   12
summary(wine.red)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
# Checking for outliers
plotType(domain = wine.red, FUN=plot)

plotType(domain = wine.red, FUN=boxplot)

par(mfrow=c(1,1))
attach(wine.red)
## The following objects are masked from wine.white:
## 
##     alcohol, chlorides, citric.acid, density, fixed.acidity,
##     free.sulfur.dioxide, pH, quality, residual.sugar, sulphates,
##     total.sulfur.dioxide, volatile.acidity
plot(citric.acid)

plot(citric.acid[500:600])

# There is a similar issue one indexs 5-600. Removing these records

rwc <- wine.red[c(1:1400,1700:nrow(wine.red)),]

# removing outliers
rwc <- subset(rwc, 
              volatile.acidity < 1.5 &
              citric.acid < 1 &
              residual.sugar < 15 &
              total.sulfur.dioxide < 250 &
              chlorides < .6 &
              sulphates < 1.8)

#Checking distribution
plotType(domain = rwc, FUN=hist)

pairs(rwc)

#applying same transformations as was done with red wine.
rwm.all <- mutate(rwc, fixed.acidity=log(fixed.acidity),
                   volatile.acidity=log(volatile.acidity),
                   chlorides=log(chlorides),
                   free.sulfur.dioxide=log(free.sulfur.dioxide),
                   total.sulfur.dioxide=log(total.sulfur.dioxide),
                   sulphates=log(sulphates)
                   )



psych::pairs.panels(rwm.all)

cor.rm <- as.data.frame(as.table(cor(rwm.all)))
subset(cor.rm, abs(Freq) > 0.5 & Var1 != Var2) %>% arrange(-abs(Freq))
##                    Var1                 Var2       Freq
## 1  total.sulfur.dioxide  free.sulfur.dioxide  0.7868810
## 2   free.sulfur.dioxide total.sulfur.dioxide  0.7868810
## 3                    pH        fixed.acidity -0.7027256
## 4         fixed.acidity                   pH -0.7027256
## 5           citric.acid        fixed.acidity  0.6852071
## 6         fixed.acidity          citric.acid  0.6852071
## 7               density        fixed.acidity  0.6700530
## 8         fixed.acidity              density  0.6700530
## 9           citric.acid     volatile.acidity -0.5538977
## 10     volatile.acidity          citric.acid -0.5538977
## 11                   pH          citric.acid -0.5402992
## 12          citric.acid                   pH -0.5402992
# The most highly correlated variables we are seeing here are not the same as with the red wine. The highest correlated are total.sulfur.dioxide with free.sulfur dioxide. For both red and white wine, alcohol is the highest correlated variable with the quality 



#smp.size <- floor(0.8 * nrow(rwm.all))

## set the seed to make your partition reproductible
#set.seed(123)
#train.ind <- sample(seq_len(nrow(rwm.all)), size = smp.size)

#rwm <- rwm.all[train.ind, ]
#rwm.test <- rwm.all[-train.ind, ]

rwm <- rwm.all

Sub-problem 2: choose optimal models by exhaustive, forward and backward selection (20 points)

Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.

regSubAllMeths <- function(dat){
  summary.metrics <- NULL
  which.all <- list()
  rs.all <- list()
  for ( my.mthd in c("exhaustive", "backward", "forward", "seqrep") ) {
    rs.res <- regsubsets(quality~.,dat,method=my.mthd,nvmax=11)
    summ.res <- summary(rs.res)
    which.all[[my.mthd]] <- summ.res$which
    #rsAll [[my.mthd]] <- rsRes
    for ( metric.name in c("rsq","rss","adjr2","cp","bic") ) {
      summary.metrics <- rbind(summary.metrics,
        data.frame(method=my.mthd,metric=metric.name,
                  nvars=1:length(summ.res[[metric.name]]),
                  value=summ.res[[metric.name]]))
    }
  }
  list(summary.metrics, which.all)
}

res.all.white <- regSubAllMeths(wwm)
sum.met.white <- data.frame(res.all.white[1])
which.white <- res.all.white[2][[1]]

# Plot of the number of variables against each metric score by method. To help us decide the optimal number of variables
ggplot(sum.met.white,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

#Retrieve "best" number of subsets for each method/metric

result.white <- sum.met.white %>% 
             group_by(method,metric) %>%
             filter((value == max(value) & metric %in% c("rsq","adjr2")) 
                  | (value == min(value) & metric %in% c("rss","cp","bic")))  
result.white
## Source: local data frame [20 x 4]
## Groups: method, metric [20]
## 
##        method metric nvars         value
##        <fctr> <fctr> <int>         <dbl>
## 1  exhaustive    rsq    11     0.3093220
## 2  exhaustive    rss    11  2428.2856334
## 3  exhaustive  adjr2     9     0.3079285
## 4  exhaustive     cp     8     7.2594258
## 5  exhaustive    bic     8 -1593.9190342
## 6    backward    rsq    11     0.3093220
## 7    backward    rss    11  2428.2856334
## 8    backward  adjr2     9     0.3079285
## 9    backward     cp     8     7.2594258
## 10   backward    bic     8 -1593.9190342
## 11    forward    rsq    11     0.3093220
## 12    forward    rss    11  2428.2856334
## 13    forward  adjr2     9     0.3079285
## 14    forward     cp     8     7.2594258
## 15    forward    bic     8 -1593.9190342
## 16     seqrep    rsq    11     0.3093220
## 17     seqrep    rss    11  2428.2856334
## 18     seqrep  adjr2     9     0.3079285
## 19     seqrep     cp     9     8.0890750
## 20     seqrep    bic     7 -1592.3390335
#From the graph and the summary, we can see that adjusted Rsqare favours 9, cp and BIC favor 8 variables across most models. From the graph we can see that there is a negligible change in RSS between 7 and 8 variables, and in fact we could probably justify using 6 variables.

#Number of variables to select
parmn <- 9
regsub.white <- data.frame(exhaustive = which.white$exhaustive[parmn,], 
                     backward   = which.white$backward[parmn,], 
                     forward    = which.white$forward[parmn,],
                     seqrep     = which.white$seqrep[parmn,])

kable(regsub.white)
exhaustive backward forward seqrep
(Intercept) TRUE TRUE TRUE TRUE
fixed.acidity TRUE TRUE TRUE TRUE
volatile.acidity TRUE TRUE TRUE TRUE
citric.acid FALSE FALSE FALSE FALSE
residual.sugar TRUE TRUE TRUE TRUE
chlorides TRUE TRUE TRUE TRUE
free.sulfur.dioxide TRUE TRUE TRUE TRUE
total.sulfur.dioxide FALSE FALSE FALSE FALSE
density TRUE TRUE TRUE TRUE
pH TRUE TRUE TRUE TRUE
sulphates TRUE TRUE TRUE TRUE
alcohol TRUE TRUE TRUE TRUE
#With 9 variables, all methods produced models using the same variables



w.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+chlorides+free.sulfur.dioxide+density+pH+sulphates+alcohol,data=wwm)
cat("White wine: MSE of ",cv.glm(wwm, w.glm, K=10)$delta[1])
## White wine: MSE of  0.5407825
########################################################
# Red wine section

# Repeating above analysis for red whine

res.all.red <- regSubAllMeths(rwm)
sum.met.red <- data.frame(res.all.red[1])
which.red <- res.all.red[2][[1]]

ggplot(sum.met.red,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

sum.met.red  %>% 
             group_by(method,metric) %>%
             filter((value == max(value) & metric %in% c("rsq","adjr2")) 
                  | (value == min(value) & metric %in% c("rss","cp","bic"))) 
## Source: local data frame [20 x 4]
## Groups: method, metric [20]
## 
##        method metric nvars        value
##        <fctr> <fctr> <int>        <dbl>
## 1  exhaustive    rsq    11    0.3764951
## 2  exhaustive    rss    11  555.7077014
## 3  exhaustive  adjr2     9    0.3723567
## 4  exhaustive     cp     8    7.7392822
## 5  exhaustive    bic     7 -595.8945564
## 6    backward    rsq    11    0.3764951
## 7    backward    rss    11  555.7077014
## 8    backward  adjr2     9    0.3723567
## 9    backward     cp     8    7.7392822
## 10   backward    bic     7 -595.8945564
## 11    forward    rsq    11    0.3764951
## 12    forward    rss    11  555.7077014
## 13    forward  adjr2     9    0.3723567
## 14    forward     cp     8    7.7392822
## 15    forward    bic     7 -595.8945564
## 16     seqrep    rsq    11    0.3764951
## 17     seqrep    rss    11  555.7077014
## 18     seqrep  adjr2     8    0.3720980
## 19     seqrep     cp     8    7.7392822
## 20     seqrep    bic     7 -595.8945564
parmn <- 9
regsub.red <- data.frame(exhaustive = which.red$exhaustive[parmn,], 
                     backward   = which.red$backward[parmn,], 
                     forward    = which.red$forward[parmn,],
                     seqrep     = which.red$seqrep[parmn,])

kable(regsub.red)
exhaustive backward forward seqrep
(Intercept) TRUE TRUE TRUE TRUE
fixed.acidity TRUE TRUE TRUE TRUE
volatile.acidity TRUE TRUE TRUE TRUE
citric.acid TRUE TRUE TRUE TRUE
residual.sugar FALSE FALSE FALSE TRUE
chlorides TRUE TRUE TRUE TRUE
free.sulfur.dioxide TRUE TRUE TRUE TRUE
total.sulfur.dioxide TRUE TRUE TRUE TRUE
density FALSE FALSE FALSE TRUE
pH TRUE TRUE TRUE TRUE
sulphates TRUE TRUE TRUE FALSE
alcohol TRUE TRUE TRUE FALSE
# Although we get the same number of parameters for the models for red and white wine, 
# our models don't agree on which variables to select. This is likely due to the different characteristics
# that contribute to a good red wine vs a good white wine. 

r.glm <- glm(quality~ fixed.acidity+volatile.acidity+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,data=rwm)


#Smaller MSE than what was observed for white wine
cat("Red wine: MSE of ",cv.glm(rwm, r.glm, K=10)$delta[1])
## Red wine: MSE of  0.40728

Sub-problem 3: optimal model by cross-validation (25 points)

Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.

# To compensate from the lack of predict method for regsubsets
# Taken from ISLR
predict.regsubsets =function (object ,newdata ,id ,...){
  form=as.formula(object$call [[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object ,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi 
}


# number of folds
k=10

#For reproducibility with sample function
set.seed(1)

folds=sample(1:k,nrow(wwm),replace=TRUE)
cv.errors.white=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))

# For each fold, we need to loop through the models with upto 11 variables
for(j in 1:k){
  best.fit =  regsubsets(quality~., data=wwm[folds!=j,], nvmax=11)
  for(i in 1:11){
    pred=predict(best.fit,wwm[folds==j,],id=i) 
    #Store the MSE for each model
    cv.errors.white[j,i]=mean((wwm$quality[folds==j]-pred)^2)
  }
}


mean.cv.errors.white <-apply(cv.errors.white ,2,mean)
par(mfrow=c(1,1))
plot(mean.cv.errors.white ,type="b", main = "10-fold CV feature selection (white wine)")

cat("White wine (k-fold): Lowest MSE on", which.min(mean.cv.errors.white) ,"variables")
## White wine (k-fold): Lowest MSE on 9 variables
#Re run the full regsubsets to ge the list of variables used
reg.best=regsubsets(quality~.,data=wwm , nvmax=11)


# Variables selected for the model with lowest MSE
kf.white <- summary(reg.best)$which[which.min(mean.cv.errors.white),]



########################################################
# Red wine section

#Repeat the above analysis 
set.seed(1)

folds=sample(1:k,nrow(rwm),replace=TRUE)
cv.errors.red=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))

# For each fold, we need to loop through the models with upto 11 variables
for(j in 1:k){
  best.fit =  regsubsets(quality~., data=rwm[folds!=j,], nvmax=11)
  for(i in 1:11){
    pred=predict(best.fit,rwm[folds==j,],id=i) 
    #Store the MSE for each model
    cv.errors.red[j,i]=mean((rwm$quality[folds==j]-pred)^2)
  }
}


mean.cv.errors.red <-apply(cv.errors.red ,2,mean)
par(mfrow=c(1,1))
plot(mean.cv.errors.red ,type="b", main = "10-fold CV feature selection (red wine)")

cat("Red wine (k-fold): Lowest MSE on", which.min(mean.cv.errors.red) ,"variables")
## Red wine (k-fold): Lowest MSE on 7 variables
#Re run the full regsubsets to ge the list of variables used
reg.best=regsubsets(quality~.,data=wwm , nvmax=11)

# Variables selected for the model with lowest MSE
kf.red <- summary(reg.best)$which[which.min(mean.cv.errors.red),]


#Comparing Variables selected from both red and white wines
kable(data.frame(kf.red, kf.white ))
kf.red kf.white
(Intercept) TRUE TRUE
fixed.acidity TRUE TRUE
volatile.acidity TRUE TRUE
citric.acid FALSE FALSE
residual.sugar TRUE TRUE
chlorides FALSE TRUE
free.sulfur.dioxide TRUE TRUE
total.sulfur.dioxide FALSE FALSE
density TRUE TRUE
pH TRUE TRUE
sulphates TRUE TRUE
alcohol FALSE TRUE
#Model has larger MSE than what was observed for red wine using regsubsets
kr.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+free.sulfur.dioxide+pH+sulphates+alcohol,data=rwm)
cat("Red wine (k-fold): MSE of ",cv.glm(rwm, kr.glm, K=10)$delta[1])
## Red wine (k-fold): MSE of  0.417774
#Model has larger MSE than what was observed for white wine using regsubsets
kw.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+free.sulfur.dioxide+pH+sulphates+alcohol,data=wwm)
cat("White wine(k-fold): MSE of ",cv.glm(wwm, kw.glm, K=10)$delta[1])
## White wine(k-fold): MSE of  0.5532324
#Regsubsets out-performed both models from cross-validation although the MSE was very close.

# For the two wine types, the k-fold cross-validated regsubsets approach has picked the same variables
# which is quite different from what was seen in the normal regsubsets
# We agree on 6 variables across the four models.
# I suspect that the k-fold models are dropping chlorides due to the high correlation we observed with alcohol which is also being included in the model.
kable(data.frame(regsub.red$exhaustive, regsub.white$exhaustive,kf.red , kf.white))
regsub.red.exhaustive regsub.white.exhaustive kf.red kf.white
(Intercept) TRUE TRUE TRUE TRUE
fixed.acidity TRUE TRUE TRUE TRUE
volatile.acidity TRUE TRUE TRUE TRUE
citric.acid TRUE FALSE FALSE FALSE
residual.sugar FALSE TRUE TRUE TRUE
chlorides TRUE TRUE FALSE TRUE
free.sulfur.dioxide TRUE TRUE TRUE TRUE
total.sulfur.dioxide TRUE FALSE FALSE FALSE
density FALSE TRUE TRUE TRUE
pH TRUE TRUE TRUE TRUE
sulphates TRUE TRUE TRUE TRUE
alcohol TRUE TRUE FALSE TRUE

Sub-problem 4: lasso/ridge (25 points)

Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.

ridge.lasso.reg <- function(dat, lbl="white wine ridge", a=1){

  # put into matrix form
  r.x <- model.matrix(quality~.,dat)[,-1]
  r.y <- dat[,"quality"]
  
  # Create train and test sets
  set.seed (1)
  r.train=sample(1:nrow(r.x), nrow(r.x)*.8)
  r.test=(-r.train)
  r.y.test=r.y[r.test]
  
  #Use cross-validation to find best value of lambda
  cv.out=cv.glmnet(r.x[r.train ,],r.y[r.train],alpha=a)
  plot(cv.out)
  bestlam=cv.out$lambda.min
  
  #Test model on test set
  ridge.pred=predict(cv.out,s=bestlam ,newx=r.x[r.test,])
  cat(lbl, " Test MSE:",mean((ridge.pred-r.y.test)^2))
  
  #Return coeff for regression model using optimal lambda
  grid=10^seq(10,-2,length=100)
  out=glmnet(r.x,r.y,alpha=1,lambda=grid)
  predict(out,type="coefficients",s=bestlam)[,1]
}

#Ridge regression for white and red wines

ww.ridge = ridge.lasso.reg(wwm, lbl="White Wine (ridge regression)", a=0)

## White Wine (ridge regression)  Test MSE: 0.5511096
rw.ridge = ridge.lasso.reg(rwm, lbl="Red Wine (ridge regression)", a=0)

## Red Wine (ridge regression)  Test MSE: 0.4045869
#Lasso regressions for white and red wines
ww.lasso = ridge.lasso.reg(wwm, lbl="White Wine (lasso regression)", a=1)

## White Wine (lasso regression)  Test MSE: 0.5472413
rw.lasso = ridge.lasso.reg(rwm, lbl="Red Wine (lasso regression)", a=1)

## Red Wine (lasso regression)  Test MSE: 0.4041392
kable(data.frame(reg.red=regsub.red$exhaustive, reg.white=regsub.white$exhaustive,kf.red , kf.white, las.white= ww.lasso>0, las.red=rw.lasso>0))
reg.red reg.white kf.red kf.white las.white las.red
(Intercept) TRUE TRUE TRUE TRUE TRUE TRUE
fixed.acidity TRUE TRUE TRUE TRUE FALSE TRUE
volatile.acidity TRUE TRUE TRUE TRUE FALSE FALSE
citric.acid TRUE FALSE FALSE FALSE FALSE FALSE
residual.sugar FALSE TRUE TRUE TRUE TRUE FALSE
chlorides TRUE TRUE FALSE TRUE FALSE FALSE
free.sulfur.dioxide TRUE TRUE TRUE TRUE TRUE TRUE
total.sulfur.dioxide TRUE FALSE FALSE FALSE FALSE FALSE
density FALSE TRUE TRUE TRUE FALSE FALSE
pH TRUE TRUE TRUE TRUE TRUE FALSE
sulphates TRUE TRUE TRUE TRUE TRUE TRUE
alcohol TRUE TRUE FALSE TRUE TRUE TRUE
#Not displaying the parameters from ridge regression as it contains all variables 
#Both ridge and lasso had very similar Test MSEs to each other, however, in general, they had lower test MSEs than the other methods looked at (esp for red wine), additionally, lasso regression only selected 6 variables. However, these differences could be due to only using .8 % of the data as the training set here. 

Sub-problem 5: PCA (10 points)

Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.

#Make copies of the datasets
white.pc <- wwm
red.pc <- rwm

#Add a source column
white.pc$source <- "white"
red.pc$source <- "red"

#Join datasets
all.pc <- rbind(white.pc, red.pc)
names(all.pc)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "source"
# Remove Quality and Source
pca <- all.pc[,-c(12:13)]
names(pca)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"
# Calculate Principal components with Scaling 
pca.scale <- prcomp(pca,scale=TRUE)

# variance explained by each principal components
plot(pca.scale)

# Scree plot for the proportion of variance explained by the principal components
pr.var <- pca.scale$sdev^2
prop.var <- pr.var/sum(pr.var)
plot(prop.var, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#We can see that the first 4 principal components are needed to explain most of the variance


# Biplot of first 2 principal components
biplot(pca.scale, cex=.5)

# We can clearly see 2 discinct clusters, we will first group them by color using the Cols function from ISLR
# We can also see that total.sulfur.dioxide and free.sulfur.dioxide are highly correlated, as are residual.sugar and citric acid
# White wines can be characterised more by, total.sulfur.dioxide and residual.sugar. Whereas reds are characterised more by volatile acid and sulphates.

Cols=function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
}



# The first 2 principal components dont do a good job of grouping based on quality
#this isnt a big surprise as the linear models didnt do a great job of predicting the quality either.
plot(pca.scale$x[,1:2], col=Cols(all.pc$quality), pch=19,
xlab="Z1",ylab="Z2")

# The first two principal components do a great job of identifying wine types
# it shows that there is a lot of variance in the predictors based on wine type
# This is in line with what was observed earlier, when using regsubsets to estimate important
# predictor variables based on wine type.

plot(pca.scale$x[,1:2], col=Cols(all.pc$source), pch=19,
xlab="Z1",ylab="Z2")

Extra 10 points: model wine quality using principal components

Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.

l.rwm = rwm[,-12]
head(l.rwm)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1      2.001480       -0.3566749        0.00            1.9 -2.577022
## 2      2.054124       -0.1278334        0.00            2.6 -2.322788
## 3      2.054124       -0.2744368        0.04            2.3 -2.385967
## 4      2.415914       -1.2729657        0.56            1.9 -2.590267
## 5      2.001480       -0.3566749        0.00            1.9 -2.577022
## 6      2.001480       -0.4155154        0.00            1.8 -2.590267
##   free.sulfur.dioxide total.sulfur.dioxide density   pH  sulphates alcohol
## 1            2.397895             3.526361  0.9978 3.51 -0.5798185     9.4
## 2            3.218876             4.204693  0.9968 3.20 -0.3856625     9.8
## 3            2.708050             3.988984  0.9970 3.26 -0.4307829     9.8
## 4            2.833213             4.094345  0.9980 3.16 -0.5447272     9.8
## 5            2.397895             3.526361  0.9978 3.51 -0.5798185     9.4
## 6            2.564949             3.688879  0.9978 3.51 -0.5798185     9.4
pc <- prcomp(l.rwm)

#From earlier, we found that we needed the first 4 principal components to explain most of the variance
betas <- pc$x %*% t(pc$rotation)
 
p=list()

for (j in 1:nrow(l.rwm)){
  for (i in 1:11){
    betas[j,i] = betas[j,i] * l.rwm[j,i]
  }
  p[j]=sum(betas[j,])
}